Measuring Machine Intelligence Through Visual Question Answering

نویسندگان

  • C. Lawrence Zitnick
  • Aishwarya Agrawal
  • Stanislaw Antol
  • Margaret Mitchell
  • Dhruv Batra
  • Devi Parikh
چکیده

Scenes The visual question-answering task requires a variety of skills. The machine must be able to understand the image, interpret the question, and reason about the answer. For many researchers exploring AI, they may not be interested in exploring the low-level tasks involved with perception and computer vision. Many of the questions may even be impossible to solve given the current capabilities of state-of-the-art computer vision algorithms. For instance the question “How many cellphones are in the image?” may not be answerable if the computer vision algorithms cannot accurately detect cellphones. In fact, even for state-of-the-art algorithms many objects are difficult to detect, especially small objects (Lin et al. 2014). Articles SPRING 2016 69 To enable multiple avenues for researching VQA, we introduce abstract scenes into the data set (Antol, Zitnick, and Parikh 2014; Zitnick and Parikh 2013; Zitnick, Parikh, and Vanderwende 2013; Zitnick, Vedantam, and Parikh 2015). Abstract scenes or cartoon images are created from sets of clip art, figure 7. The scenes are created by human subjects using a graphical user interface that allows them to arrange a wide variety of objects. For clip art depicting humans, their poses and expression may also be changed. Using the interface, a wide variety of scenes can be created including ordinary scenes, scary scenes, or funny scenes. Since the type of clip art and its properties are exactly known, the problem of recognizing objects and their attributes is greatly simplified. This provides researchers an opportunity to study more directly the problems of question understanding and answering. Once computer vision algorithms catch up, perhaps some of the techniques developed for abstract scenes can be applied to real images. The abstract scenes may be useful for a variety of other tasks as well, such as learning commonsense knowledge (Zitnick, Parikh, and Vanderwende 2013; Antol, Zitnick, and Parikh 2014; Chen, Shrivastava, and Gupta 2013; Divvala, Farhadi, and Guestrin 2014; Vedantam et al. 2015).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

FVQA: Fact-based Visual Question Answering

Visual Question Answering (VQA) has attracted much attention in both computer vision and natural language processing communities, not least because it offers insight into the relationships between two important sources of information. Current datasets, and the models built upon them, have focused on questions which are answerable by direct analysis of the question and image alone. The set of su...

متن کامل

Image Captioning and Visual Question Answering Based on Attributes and External Knowledge.

Much of the recent progress in Vision-to-Language problems has been achieved through a combination of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). This approach does not explicitly represent high-level semantic concepts, but rather seeks to progress directly from image features to text. In this paper we first propose a method of incorporating high-level concepts in...

متن کامل

Multimodal Machine Learning: Integrating Language, Vision and Speech

Multimodal machine learning is a vibrant multi-disciplinary research field which addresses some of the original goals of artificial intelligence by integrating and modeling multiple communicative modalities, including linguistic, acoustic and visual messages. With the initial research on audio-visual speech recognition and more recently with language & vision projects such as image and video ca...

متن کامل

Pororobot: Child Tutoring Robot for English Education

The recent success of machine learning has lead to advancements in robot intelligence and human-robot interaction. It is reported that robots can well understand visual scene information and describe the scenes in language using computer vision and natural language processing methods. Image Question-Answering (QA) systems can be used for human-robot interaction. However, to achieve human-level ...

متن کامل

Visual Question Answering using Deep Learning

Multimodal learning between images and language has gained attention of researchers over the past few years. Using recent deep learning techniques, specifically end-to-end trainable artificial neural networks, performance in tasks like automatic image captioning, bidirectional sentence and image retrieval have been significantly improved. Recently, as a further exploration of present artificial...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • AI Magazine

دوره 37  شماره 

صفحات  -

تاریخ انتشار 2016